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ABSTRACT 

This paper is focused on the language modelling for 
task-oriented domains and presents an accurate anal- 
ysis of the utterances acquired by the Dialogos spoken 
dialogue system. Dialogos allows access to the Ital- 
ian Railways timetable by using the telephone over the 
public network. 

The language modelling aspects of specificity and 
behaviour to rare events are studied. A technique 
for getting a language model more robust, based on 
sentences generated by grammars, is presented. Ex- 
perimental results show the benefit of the proposed 
technique. The increment of performance between lan- 
guage models created using grammars and usual ones, 
is higher when the amount of training material is lim- 
ited. Therefore this technique can give an advantage 
especially for the development of language models in a 
new domain. 

1. INTRODUCTION 

Statistical language modelling (LM) is currently used 
for two different classes of applications: dictation sys- 
tems and task-oriented spoken dialogue systems (SDS). 

The first kind of systems are tested with a very 
large vocabulary (60-20,000 words) and they need the 
availability of a huge amount of training data, for in- 
stance WS J-NAB has a 45 million word text corpora || . 

SDSs are used in specific task-oriented domains, 
and they need special training material, which can be 
obtained either by expensive simulations J(| or by using 
the SDS itself. The use of a general task-independent 
corpus for LM of a SDS could increase, in comparison 
to LM that use a task-dependent one, the perplexity 
by an order of magnitude . This is due to the mis- 
match between the general corpus and the specific ap- 
plication domain. In any case the acquired material 
is very limited, for instance the LM in the Air Travel 
Information System (ATIS) is based on a training-set 
of only 250,000 words |o). 

This paper is focused on the language modelling 
for task-oriented domains. The tests made uses the 
utterances acquired by the Dialogos, the SDS which 



allows access to the Italian Railways timetable by us- 
ing the telephone over the public network [Q. Other 
similar systems are described in g H Q . The vocabu- 
lary of Dialogos contains 3,471 words, clustered in 358 
classes. The semantically important words are grouped 
into classes, such as city names (2,983 words), numbers 
(76 words), and so on. During the recognition, a class- 
based bigram LM is used, and the 25-best sequences 
are rescored using a trigram LM. 

Section || shows how well a LM captures the speci- 
ficity of the domain, while Section |3| studies the be- 
haviour of the LM to rare events. Finally Section |^ 
illustrates a technique for generalising a LM by adding 
n-grams generated by a grammar. 

2. SPECIFICITY OF A LANGUAGE 
MODEL 

A relevant characteristic of a task-oriented domain is 
the distribution of the user utterances in a corpus. Us- 
ing the Dialogos SDS, a corpus of 1,363 spoken dia- 
logues has been acquired, from 493 unexperienced sub- 
jects, that called the system from all over Italy Jl], §|. 

For the present study, the collected material was 
divided into two parts: a training-set of 20,511 utter- 
ances and a test-set of 2,040 utterances. Each utter- 
ance was transformed in a normalised form (NU), by 
changing each city name, month name and number into 
a class tag. For instance the user utterance: 

"/ want to leave from Naples to Rome Monday at 
five (o'clock)" 

becomes the following NU: 

"I want to leave from CITY-NAME to CITY-NAME 
WEEK-DAY at HOUR-NUMBER". 

For the sake of the language modelling, the NU is 
equivalent to the original utterance 

It is worth noticing that even a small number of 
very frequent NUs cover a great part of the acquired 
data (see Figure [I]) . The 7-th most frequent NUs cover 
58% of the training-set, and 54% of test-set, and the 
first 191-st cover nearly 80% of test-set and over 85% of 



1 This is because these classes are being used by the class- 
based LM and each word in a class has been considered with 
equal probability. 
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Figure 1: Coverage of training and test sets by the 
NUs. 

training- set. On the other hand the NUs with just one 
occurrence are 2,060, and more then 56% of them con- 
tain some spontaneous speech phenomena. This result 
shows that a few frequent NUs can already give a quite 
sensible picture of the user utterance distribution. 

Moreover some partial training-sets were selected, 
which include the first n utterances in the whole train- 
ing set, for n ranging from 100 to 20,511 utterances. 
For each partial training-set a LM was created and the 
recognition (WA) and understanding (SU) rates are 
given in Figure |[ The performances of the LMs cre- 
ated on a partial training-set were compared with an 
experiment without any LM, which is even reported in 
Figure H as 0- utterance training-set. A LM trained on 
only 100 utterances achieves a remarkable error rate 
reduction of 30% of SU and 23% of WA, especially if it 
is compared with the error reduction when the whole 
20,511 training-set is used, that is of 43% of SU and 
39% of WA. 
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Figure 2: Variation of performance with size of train- 
ing data. 

A coherent behaviour is also confirmed by perplex- 
ity values (PP) depicted in Figure ||, where the utter- 
ances were classified according to the kind of prompt 
generated by the system. Three representative points 
have been selected, which are the request of: depar- 
ture and arrival city (City), time of departure (Time), 
and date of departure (Date) . For these categories the 
PP of a 100-utterance LM is two times higher than a 
1,000- utterance one and three times the LM trained on 
the whole training-set. The fact that, the PP values 
for the City requests are the highest, can be explained 
by the large number of city-names in the vocabulary 
(2,983, near 85% of the whole vocabulary). 
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Figure 3: Variation of perplexity with size of training 
data. 

3. ROBUSTNESS TO RARE EVENTS 

In this Section the behaviour of the LM with respect 
to rare events is studied. The test-set of 2,040 utter- 
ances was split into two parts: The first part contains 
362 utterances, whose 351 NUs do not appear in any of 
the partial training-sets. This is referred below as the 
unseen part of the test-set. The second part includes 
the rest of the test-set (1,678 utterances, but only 257 
NUs). The NUs in the partial training-sets cover pro- 
gressively the utterances of the second part. For in- 
stance, the 100-utterance training-set contains only 29 
NUs, which cover 1,317 of these 1,678 utterances. 

Both recognition, and overall understanding results 
show quite similar values for the 1,678 utterances (82- 
85% of SU), but they are very different for the unseen 
part (33-46% of SU) , see Figure ||. The performance 
on the unseen part is an indicator of the robustness 
of the model. In the following the reason for the low 
performance on the unseen part is further analysed. 

The NUs with more then three occurrences in the 
global training-set, and different one to each other, 
were selected. Table [j] shows the number of this NUs, 
that exists in each one of the partial training-sets. 
They were divided into groups according to the differ- 
ent kind system request. The growth of NUs for City 
and Date is fast until 5,000 utterances are reached, 
then it becomes very slow. This indicates that there is 
a kind of saturation. While Time NUs increase nearly 
proportionally. 

Moreover, the NUs, whose frequency in the training- 
set is greater than 0.1%, were compared with the ones 
in the test-set. We observed that the selected NUs of 
the training-set covers more then 90% of the test-set 
NUs, in case of City and Date, but only 55% in case 
of Time. Therefore, the City and the Date groups are 
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Figure 4: Evaluation of trained and untrained part of 
the test-DB. 
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Table 1: Number of frequent NUs in partial training- 
sets. 

considered much more robust than the Time group, be- 
cause the frequent NUs do not indicate a saturation, 
and because there is a lack of the training-set NUs in 
the test-set. This is due to the high variability of the 
time expressions. 

4. INCREASING ROBUSTNESS BY 
ADDING N-GRAMS GENERATED BY 
GRAMMARS 

Another coverage test was made using grammars. A 
grammar was created (explained in Section 4.1) on the 
basis of the NUs in the 500- utterance training-set. The 
sentences generated by the grammar showed a coverage 
of 85% of the NUs in the 20,511 training-set. This sug- 
gests that, the robustness of a LM may be increased by 
the use of a simple grammar derived from the common 
NUs in the training material. 

At first the sentences generated by grammar were 
added to the training material. The obtained LMs, 
did not improve results, because the addition of the 
grammar generated sentences, greatly changes the fre- 
quency distribution of the n-grams, and reduces the 
specificity of the training-set. 

The adopted solution was to create the LM starting 
from a data-base that contains n-grams, and not from 
a data-base of generated sentences. This made possi- 
ble to add only the not-existing n-grams which do not 
highly affect the specificity. Therefore the tool used for 
training the LMs was changed, in order to be able to 
process both sentences and n-grams. Commonly when 
the n-grams are extracted from a sentence, they get 
automatically all their contexts (the (n-l)-gram that 
precedes the n-th word of the n-gram). On the other 
hand, if an n-gram is artificially added, it is neces- 
sary to incorporate even the missing contexts for this 
n-gram. 

4.1. GRAMMAR CREATION 

The grammars used in the following tests were man- 
ually created, and they started from a set of correct 
NUs selected from a training-set. For each NU, se- 
mantic concepts were identified, then for each of these 
concepts a non-terminal was introduced, and, finally, 
each non-terminal was generalised. For instance, in the 
case of a Time NU: 

"in the morning after seven o'clock", 

the following non-terminal sequence could be iden- 
tified: 



Part_of-Day Time-Specifier Time_Identifier. 

Part-of-Day can become also "in the afternoon", 
"in the evening" or "at lunch time", 
TimeSpecifier can be expressed as: "before", "not ear- 
lier than", while for Time_Identifier other forms are: "a 
quarter to seven", "twenty minutes past seven". 

At this point both the 1,000-utterance training-set 
(SPTS-1,000) and the global one (STS) were split ac- 
cording to the system request. Concentrating the anal- 
ysis on the City, Date, and Time requests, for the 
syntactically and semantically correct NUs in SPTS- 
1,000 a grammar was created. For instance, there are 
107 NUs in the SPTS-1,000 Date requests, and 2,483 
NUs in STS. 

For Date and Time requests group one grammar 
was created (Gr_D, and Gr_T respectively), whereas 
two for the City requests: Gr_C which generalises only 
NUs about departure and arrival location, and Gr_Cdt 
which also generalises data and time, because the an- 
swers to the City requests could also contain that in- 
formation. 
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Table 2: Event composition of the training-sets. 



4.2. CREATION OF GENERALISED LMS 

The merge between the n-grams extracted from a train- 
ing set and from sentences generated by the gram- 
mar was done using the following technique. At first, 
both the training-set and the sentences generated by 
a grammar were transformed in n-grams (n=3), then 
three type of events were considered: n-grams which 
are present both in the training-set and in the gen- 
erated sentences (called usual events), n-grams which 
exist only in the training-set (called rare events), and 
n-grams which exist only in the generated sentences 
(called unknown events). 

Into the new LM, the unknown events were added 
only once, while the rare events maintained their fre- 
quencies (which is quite low). In many cases the num- 
ber of unknown events is much more higher than the 
number of usual events. For instance in the case of time 
there are 276 usual events obtained from SPTS-1,000, 
36 rare events and 1,748 unknown events. Therefore 
the quantities of usual and unknown events are weighted, 
by multiplying them with a balance- factor. At this 
point, a language model is created, then the best value 
for the balance-factor (BaFa) is empirically determined 
by the minimisation of the PP on the test-set. 

Using Table the event composition of each one 
of the studied LMs can be computed. For each request 
group many LMs were created by the generalisation of 



SPTS-1,000 and STS, respectively part and all in the 
Table. It is worth noticing that in a baseline LM only 
the usual and rare events are considered. 

4.3. EXPERIMENTAL RESULTS 

In this Section, the performances of the LMs that in- 
clude n-grams generated by a grammar were compared 
with baseline LMs which does not make use of gram- 
mar n-grams. These baseline LMs are reported in the 
Tables with the tag unused in the grammar col- 
umn. 
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Table 3: Recognition and understanding results. 

Table | shows that the LMs created using the gram- 
mars, obtain better results for the SPTS-1,000 LMs, 
while for the STS LMs the increment is rather limited. 
In particular, for Time and City the improvement of 
WA is significant. The reasons are: the high variabil- 
ity of time expressions and the fact that sometimes the 
City requests even include information about Date 
and Time, especially in the first utterance to the sys- 
tem. This fact is evident from the improvement ob- 
tained by the use of the Gr_Cdt grammar, which even 
increases the performance of the STS LM. 

Moreover the merge of with SPTS-1,000 with gram- 
mars improve the results, but they could not reach the 
performances of the baseline STS LMs. An explana- 
tion is that the used grammars do not model the highly 
frequent extra-linguistic phenomena. 

In addition the perplexity of these LMs has been 
studied. For each group the analyses of the PP has 
been performed on the test-set and even on the sen- 
tences generated by the grammar. Table ^ shows PP 
results for all the LMs tested on the specific part of 
the test-set. The generalisation of the LMs by using 
grammar n-grams does not significantly affect the PP. 
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Table 4: Perplexity results on the test-set. 



The use of a test-set of sentences generated by the 
grammars, even if it does not give a correct insight of 
the behaviour of the system on a test-set acquired from 
real users, because the sentence distribution is artifi- 
cial, it can show the degree of generalisation. These 
PP results have been reported in Table || and Table |^ 
according to the number of unknown events reported 
in Table ^| In the former are shown the results for 
small grammars (G_C, and G_D), while in the latter 
the results for large ones (G_Cdt, and G_T). 
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Table 5: Perplexity results on the grammar sentences. 
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Table 6: Perplexity values for Gr_C and Gr_T. 

In Table [| a clear reduction of the PP could be ob- 
served for the LMs which includes grammar n-grams. 
This reduction is higher for the LMs trained over SPTS- 
1,000 (66%), but it is relevant even for the LMs trained 
on STS (33%). 

Making a similar comparison of the PP results, pre- 
sented in Table ^, for the large sets of unknown evens, 
as expected, a more significant reduction was obtained, 
that goes from a minimum of 77% to a maximum of 
94%. 

5. CONCLUSIONS 

This papers shows that, in a task-oriented domain, a 
LM trained out with a small amount of training ma- 
terial (1,000 utterances) acquired form naive users, al- 
lows to obtain rather good results, especially in the 
case of the more common NUs. This is because com- 
mon NUs are a few, but very frequent. 

Secondly, in a task-oriented domain with a very 
limited training-set, the robustness of a language mod- 
elling can be increased by the use of a simple grammar 
derived from the common NUs in the training material. 

A technique for the generalisation of a language 
model adding n-grams generated by a grammar is de- 
scribed. The advantage of this technique is shown by 
experimental results. The improvements obtained by 
using this technique, are especially good for language 
models trained on a small amount of training mate- 
rial, and therefore the technique can be used in the 



first phases of the development of a LM for a new do- 
main. Even if the generalised LMs do not increase the 
performance of a model trained on a large training- 
set, the perplexity indicates a better behaviour of the 
models in the case of rare events. 
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